CSE 373: Data Structures and Algorithms

Winter 2000

Programming Assignment 2 Due February 11, 2000

 

Please include your name and student id # on what you turn in.  What you turn in must include a listing of the program, input data, and output data.  For the data files you must turn in a hex dump of the file.  As the deadline nears we will provide you with a set of sample input data.

 

Your assignment is to implement a compression/decompression program for files.  This is a two-part program, a compression program and a decompression program.  The compression program is to take as input a data file and outputs a smaller second file while maintaining data fidelity.  The decompression program takes as input the compressed file and outputs the original uncompressed contents.  Think of the compression program as making freeze dried coffee while the decompression program puts the water back in.

 

The basic concept behind data compression:

 

Data compression is essentially finding a way to store redundant data in less space.  For example consider the following data

 

            ABCDEFGHIJBCDEFGZ

 

Notice that the underlined sub-string BCDEFG is repeated and takes up 6 bytes of data.  Now, if we replace the second occurrence of BCDEFG with information about what to repeat and we can do it in less than 6 bytes then we can compress the data.  We express this information in terms of how many characters to back and how much data to copy.  In our example we back up 9 characters and copy 6 characters.  So the compressed data might look like

 

            ABCDEFGHIJ[9,6]Z

 

Where we’ve replaced the second occurrence of BCDEFG with “[9,6]”.  The more redundancy we can find and replace the more we can compress.

 

Consider a second example,

 

            AUGGGGGGGGGGGGGGGGGGH

 

In this example the character G occurs 18 times in a row.  The compressed form for this string is:

 

            AUG[1,17]H

 

This indicates we are to backup 1 character and copy 17 bytes.

 

Now the output stream is made up of what are called “tokens”.  There are two types of tokens, one is a “data token” and the other is a “copy token”.  Data tokens correspond to the input characters that also appear in the output stream.  In the preceding example, ‘A’, ‘U’, ‘G’ and ‘H’ are data tokens.  Copy tokens correspond to the copy action necessary to reconstitute the original input file.  In the preceding example ‘[1,17]’ is a copy token.

 

So the basic idea behind compression to is identify and replace redundancies with copy tokens.

 

The actual compression data format:

 

The compression format you’ll be implementing is a variation of a format called LZW and LZNT1.  It slightly different variation is used extensively in Windows 2000 for compressed data storage on NTFS. 

 

Recall that every ASCII character occupies a byte and a byte is 8 bits.  The compressed format has data tokens occupying 1 byte each and copy tokens taking up 2 bytes each.  An extra bit is also needed to indicate the type of token.  To do this we’ll precede every 8 tokens of output with a token control byte.  Each bit in the control byte corresponds to a token.  The first bit corresponds to the first token and so forth.  If the control bit is a 0 then the token is a data token.  If the control bit is a 1 then the token is a copy token.  Bits in a control byte are numbered from right to left (least significant to most significant bit).  Every control byte can then have as few as 8 bytes and as most as 16 bytes following it, corresponding to all data tokens or all copy tokens respectively.  The first byte of a copy token is the number of characters to backup and the second byte stores the number of characters to copy.  So consider our earlier example

 

            ABCDEFGHIJ[9,6]Z

 

The output for this file would be:

 

            0x00 ‘A’ ‘B’ ‘C’ ‘D’ ‘E’ ‘F’ ‘G’ ‘H’ 0x04 ‘I’ ‘J’ 0x09 0x06 ‘Z’

 

where 0x00 is a control byte of all zeros, 0x04 (in binary is 00000100) is a control byte with 1 in the third bit position, and “0x09 0x06” is a copy token to go back 9 bytes and copy 6 bytes.

 

Because the copy token only uses one byte each for distance and length it can only go back 255 bytes and copy at most 255 bytes.  So if the redundancy was further back in the file than 255 characters a copy token cannot be used.  Also if the redundancy is more than 255 characters we would need to use multiple copy tokens.  For example if the data file is the character A repeated 1000 times, then the compressed data would look like

 

            0x1E ‘A’ 0x01 0xFF 0x01 0xFF 0x01 0xFF 0x01 0xEA

 

where 0x1E (in binary is 00011110) indicates 1 data token followed by 4 copy tokens, “0x01 0xff” is a copy token to go back 1 byte and copy 255 bytes, and “0x01 0xea” is a copy token to go back 1 byte and copy 234 bytes.

 

Notice that two data tokens take up the same amount of space as one copy token.  So unless the redundancy is three bytes or more it is not worth replacing it with a copy token.

 

Also note that in the preceding example the size of the original data file is 1000 bytes and the size of the compressed file is 10 bytes.  Knowing when the compressed file ends is very important because that is the only indication you have of when the tokens stop.  A program cannot blindly process tokens based on the control byte, it must also consider the end of file.

 

Compression:

 

Your compression program is to input uncompressed data from stdin and output compressed data to stdout.  The input data will not be more than 10,000 bytes so go ahead and pre-allocate your data buffer.  Use a simple hash table to identify redundancies.  The hash table will use a three-byte key and contain offset information within the input stream.  The size of the hash table is 256 and when a collision occurs the new entry erases the old entry (i.e., essentially no collision resolution).   So the compression algorithm is pretty much as follows (in broad generalities)

 

·        Keep a current input stream and output stream location index.

·        Scan the input stream from beginning to end

o       Take the next three bytes at the current input stream index and look up the key in the hash table. 

§         If you get a hit then compare the data at your current input stream location against the one previous one indicated in the hash table entry

§         If these match and it is less than 256 characters back then figure out how much to copy, output a copy token, modify the hash table entry with the more recent input location, and advance the input stream index by the amount just copied

o       If a match of 3 characters or more is not found then add the three byte key to the hash table, output one data token, advance one byte in the input stream and continue with the top of the loop again.

 

Your output logic will need to buffer up a control byte and 8 tokens before outputting them, or you can buffer up the entire output stream and output the data all at once.  Be sure to terminate your program correctly.  The last set of output tokens will probably not add up to a nice multiple of eight so you’ll need to check for EOF.

 

The hash table should be an array of 256 integers.  Each entry in the table is the index within the input file for the hashed value.  A good hash function to use is:

 

HashIndex = InputBufer[CurrentInputIndex];

HashIndex = ( (HashIndex << 8) | (HashIndex >> 4) );

HashIndex = ( HashIndex ^ InputBuffer[CurrentInputIndex+1] ^ (InputBuffer[CurrentInputIndex+2]<<4) ) & 0xff;

 

This should give you a pretty good distribution of hash values.  To check if the hashed value is really a good match we’ll need to compare the bytes at InputBuffer[HashTable[HashIndex]] with the bytes at InputBuffer[CurrentInputIndex].  If three of more bytes match then you can output a copy token otherwise output a single data token and start over again.

 

Decompression:

 

Your decompression program is to input compressed data from stdin and output uncompressed data to stdout.  The generated output stream will not be more than 10,000 characters.  The decompression program does not use hash tables.  The algorithm, again in broad generalities, is essentially

 

§         Keep a current input stream and output stream location index

§         Scan the input stream from beginning to end

o       Use the control token to decipher the next 8 tokens (provided we haven’t read the end of file)

§         For a data token simply transfer the byte from the input stream to the current output stream index location, and update the output stream index by one byte

§         For a copy token perform the indicated copy operation on the output stream, and update the output stream index  by the number of bytes copied.   Look at the earlier examples to convince yourself how this works.

 

Be sure to terminate your program correctly.  The end of file for the input stream will tell you when to stop processing tokens.  It probably will not terminate on a nice multiple of eight tokens boundary.

 

Note that strncpy may not work for this case especially if it tries to solve the problem we highlighted in Assignment #1 exercise #6.  Therefore you will probably need to write your own string copy routine.

 

In C an easy way to set or clear a bit in the control byte might be

 

ControlByte |= 1<<(CurrentTokenCount % 8) // set bit

ControlByte &= ~(1<<(CurrentTokenCount % 8)) // clear bit

 

Where CurrentTokenCount is just a counter that we keep of the total number of tokens we’ve produced.  An easy way to test a bit is then

 

if (ControlByte & (1<<(CurrentTokenCount % 8)))

 

Where the “if” is true when the bit is set.

 

What to turn in:

 

On the due date you must turn in the program listing, and a hex dump of the original data file, compressed data file, and uncompressed data file.  For example, if my two programs are called compress and uncompress then on Windows NT I would do the following commands to generate and print the hex dumps.

 

C:\> compress < OriginalDataFile > CompressedDataFile

C:\> uncompress < CompressedDataFile > UncompressedDataFile

C:\> hd < OriginalDataFile > OriginalDataFile.hd

C:\> hd < CompressedDataFile > CompressedDataFile.hd

C:\> hd < UncompressedDataFile > UncompressedDataFile.hd

C:\> print compress.c

C:\> print uncompress.c

C:\> print OriginalDataFile.hd

C:\> print CompressedDataFile.hd

C:\> print UncompressedDataFile.hd

 

In Windows 98 this is the “od” program.  In Windows NT there should be a “hd” program.  Please see the instructor if you cannot find a hex dump program.

 

Extra Credit:

 

Implement the same program but change the hash table into a splay tree.  With this change in implementation you will no longer miss identifying redundancies because of hash table collisions.  Run this modified program on the same data and turn it in.  With this second program is the compressed data any smaller?

 

One last word

 

Here is a final word of advice.  There are more patents on data compression than there are fleas in a kennel, and twice as many attorneys ready to enforce those patents.  You need to think long and hard before you try and market yet another compression program.